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Abstract 

We study the classical approximate string matching problem, that is, given strings P and Q and an 
error threshold k, find all ending positions of substrings of Q whose edit distance to P is at most k. 
Let P and Q have lengths m and n, respectively. On a standard unit-cost word RAM with word size 
w > log n we present an algorithm using time 
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When P is short, namely, m = 2°'^'°^"' or m = 2°'^\/^l log™) ^jjjg improves the previously best known 
time bounds for the problem. The result is achieved using a novel implementation of the Landau- Vishkin 
algorithm based on tabulation and word-level parallelism. 

1 Introduction 

Given strings P and Q and an error threshold k, the approximate string matching problem is to report all 
ending positions of substrings of Q whose edit distance to P is at most k. The edit distance between two 
strings is the minimum number of insertions, deletions, and substitutions needed to convert one string to 
the other. Approximate string matching is a classical and well-studied problem in combinatorial pattern 
matching with a wide range of applications within areas such as bioinfomatics, network traffic analysis, and 
information retrieval. 

Let m and n be the lengths of P and Q, respectively, and assume w.l.o.g. that k < m < n. The classic 
textbook solution to the problem, due to Sellers [26], fills in an (m + 1) x {n + 1) distance matrix C such 
that Ci_j is the smallest edit distance between the prefix i of P and a substring of Q ending at position j. 
Using dynamic programming each entry in C can be computed in constant time leading to an algorithm 
using time 0{nm). 

Several improvement of this algorithm are known. Masek and Paterson [21] showed how to compactly 
encode and tabulate solutions to small submatrices of the distance matrix. Multiple entries in the table can 
then be traversed in constant time leading to an algorithm using 0(nm/ log^ n + n) time. This tabulation 
technique is often referred to as the Four Russian technique after Arlazarov et al. [4] who applied it to 
boolean matrix multiplication. Alternatively, several algorithms using the arithmetic and logical operations 
of the word RAM to simulate the dynamic program have been suggested [5,6,17,23,29,30]. This technique 
is often referred to as word-level parallelism or bit-parallelism. The best known bound is due to Myers [23] 
who gave an algorithm using 0(nm/w -\- n) time. In terms of n and m alone these are the best known 
bounds for approximate string matching. However, if we take into account the error threshold k several 
faster algorithms are known [9,12,13,19,22,25,27,28]. These algorithms exploit properties of the diagonals 
of the distance matrix C and are therefore often called diagonal transition algorithms. The best known 
bound is due to Landau and Vishkin [19] who gave an 0{nk) algorithm. Compared to the algorithms by 
Masek and Paterson and Myers the Landau- Vishkin algorithm (abbreviated LV-algorithm) is faster for most 
values of fc, namely, whenever k = o(m/ log^n) or fc = o{m/w). For k = 0(to^/^) Cole and Hariharan 
showed that it is even possible to solve the approximate string matching in 0{n) time by "filtering" out all 
but a small set of positions in Q which are then checked using the LV-algorithm. 
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All of the above bounds are valid on a unit-cost RAM with w-bit words and a standard instruction set 
including arithmetic operations, bitwise boolean operations, and shifts. Each word is capable of holding a 
character of Q and hence w > logn. The space complexity is the number of words used by the algorithm, 
not counting the input which is assumed to be read-only. For simplicity, we assume that suffix trees can be 
constructed in linear time, which is true for any polynomially sized alphabet [11]. This assumption is also 
needed to achieve the 0{nk) bound of the Landau- Vishkin algorithm [19]. All the results presented here 
assume the same model. 

1.1 Results 

We present a new algorithm for approximate string matching achieving the following bounds. 

Theorem 1 Approximate string matching for strings P and Q of length m and n, respectively, with error 
threshold k, can be solved 

(i) in time 0{nk ■ ''j'^^ ™ -I- n) and space 0{n'^ + m), for any constant e > 0, and 

(ii) in time 0{nk ■ "iiogw _|_ space 0{m). 

When P is short, namely, m — 2°(^'°s") or m = 1°'^\J'^I log"") improves the 0{nk) time bound and places 
a new upper bound on approximate string matching. For many practically relevant combinations of n, m 
and k this significantly improves the previous results. For instance, when m is polylogarithmic in n, that is, 
TO = O(log'^rt) for a constant c > 0, Theorem [Iji) gives us an algorithm using time 0{nk ■ -t- n) 

which is an almost logarithmic speed-up of 0{ (iog°o^„)2 ) over the 0(nk) bound. Note that the exponent c 
in the length of the pattern only affects the constants in the time bound of our algorithm. For larger m the 
speed-up smoothly decreases until to = 2®(^'°s") when we arrive at the 0{nk) bound. 

The algorithm for Theorem [Iji) tabulates certain functions on elogn bits which lead to the additional 
0(2''°s") = O(n^) space. The algorithm for Theoreml^ii) instead uses word-level parallelism and therefore 
avoids the additional space for lookup tables. Furthermore, for w — O(logn) Theorem [Ijii) gives us an 
algorithm using time 0{nk ■ '"'^ -^^'"s'QS" + n) which is a factor O (log logn) slower than Theorem [Iji) . 
However, the bound increases with w and whenever wlogw — w(logn). Theorem [Iji) is the best time 
bound. 

1.2 Techniques 

The key idea to obtain our bounds is a novel implementation of the LV-algorithm, that reduces approximate 
string matching to 2 operations on a compact encoding of the "state" of the LV-algorithm. We then show how 
to implement these operations using tabulation for Theorem [TJi) or word- level parallelism for Theorem [T](ii). 
Eventhough tabulation and word-level parallelism are well studied techniques for speeding up approximate 
string matching based on Sellers classical dynamic programming algorithm [26], no similar improvement of 
diagonal transition algorithms are known. Achieving this is also mentioned as an open problem in a recent 
survey by Navarro [24, p. 61]. The main problem is the complicated dependencies in the computation of 
the LV-algorithm. In particular, in each step of the LV-algorithm we map entries in the distance matrix to 
nodes in the suffix tree, answer a nearest common ancestor query, and map information associated with the 
resulting node back to an entry in distance matrix. To efficiently compute this information in parallel we 
introduce several new techniques. These techniques differ significantly from the techniques used to speed-up 
Sellers algorithm and we believe that some of them might be of independent interest. For example, we give 
a new algorithm to efficiently evaluate a compact representation of a function on several inputs in parallel. 
We also show how to use a recent distributed nearest common ancestor data structure to efficiently answer 
multiple nearest common ancestor queries in parallel. 

The results presented in this paper are mainly of theoretical interest. However, we believe that some 
of the ideas have practical relevance. For instance, it is often reported that the nearest common ancestor 
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computations make the LV-algorithm unsuited for practical purposes [24] . With our new algorithm we can 
compute several of these in parallel and thus target this bottleneck. 



1.3 Outline 



The paper is organized as follows. In Section [2] we review the basic concepts and the LV-algorithm. In 
Section [3] we introduce the packed representation and the key operations needed to manipulate it. In 
Section [4.2l we reduce approximate string matching to two operations on the packed representation. Finally, 
in Sections and Section [S] we present our tabulation based algorithm and word- level parallel algorithm for 
these operations. 

2 Preliminaries 

We review the necessary concepts and the basic algorithms for approximate string matching which we will 
use as a starting point for our own algorithms. 

2.1 Strings, Trees, and SuiRx Trees 

Let S* be a string of length |5| on an alphabet E. The character at position i in S* is denoted S[i] and the 
substring from position i to j is denoted by S'[z, j]. The substrings S'[l, j] and S[i, \S\] are the prefixes and 
suffixes of S, respectively. The longest common prefix of two strings is the common prefix of maximum 
length. 

Let T be a rooted tree with \T\ nodes. A node in T is an ancestor of a node w if w is on the path 
from the root to w (including v itself) . A node z is a common ancestor of nodes v and w if z is an ancestor 
of both. The nearest common ancestor of v and w, denoted nca(w, w), is the common ancestor of v and w 
of greatest depth in T. With linear space and preprocessing time we can answer nca queries in constant 
time [16] (see also [2,8]). 

The suffix tree for S, denoted Ts, is the compacted trie storing all sufhxes of S [14]. Each edge is a 
labeled by some substring of S called the edge-label, and the concatenation of edge-labels on a path from 
the root to a node v is called the path-label of v. The string-depth of v, denoted strdepth(w), is the length of 
the path-label of v. The ith suffix of S is represented by the unique leaf in Ts whose path-label is S[i, \ S\] 
and we denote this leaf by leaf (i). The suffix tree uses linear space and can be constructed in linear time for 
polynomially sized alphabets [11]. 

A useful property of suffix trees is that for any two leaves leaf(i) and leaf (j) the path label of the node 
nca(leaf (i), leaf (j')) is longest common prefix of the sufhxes S[i, \S\] and S[j, \S\] [14]. Hence, if we construct 
a nearest common ancestor data structure for Tg and compute the string depth for each node in Tg we can 
compute the length of the longest common prefix of any two sufhxes in constant time. 

For a set of strings Si, . . . , Si it is straightforward to construct a sufhx tree Ts-^ s, storing all suffixes 

of each string in ^i, . . . ,Si [14]. The space for Tsi^...,Si is linear in the total length of strings. 

2.2 Algorithms for Approximate String Matching 

Recall that \P\ = m and \Q\ — n and k is the error threshold. The algorithm by Sellers [26] fills in a 
(n -|- 1) X (m + 1) matrix C according to the following rules: 



(The rules are the same as those for the dynamic programming algorithm edit distance problem except for 
the boundary condition Coj = 0). The entry dj is the minimum edit distance between P[l,i] and any 




min(i:)i-ij-i + (5(pi, ij), A-ij + 1, Aj-i + 1) 







(1) 
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substring ending at Q[j], and hence there is a match of P with a most k edits that ends at Q[j] iff Cmj < k. 
Using dynamic programming we can compute each entry in constant time leading to an 0{nm) solution. 

Landau and Vishkin [19] presented a faster algorithm to compute essentially the same information as 
in which we call the LV-algorithm in the rest of the paper. Define diagonal d of C to be the set of entries 
Cij such that j — i = d. Given a diagonal d and integer e, define the diagonal position Ld,e to be the largest 
row i such that Cij = e and Cij is on diagonal d. There is a match of P with a most k edits that ends at 
Q[d + m] iff Ld.e = w for some e < k. Let \cp{i,j) denote the length of the longest common prefix of P and 
Q. Using the clever observation that entries in a diagonal are non-decreasing in the downwards direction, 
Landau and Vishkin gave the following rules to compute L^^e- 

Ld.-i = Ln+i.e = ^1, for e G { — 1, . . . , fc} and d € {0, . . . , n} 
Ld.\d\-2 = \d\ - 2, for d e {-{k + 1), . . . , -1} 

Ld.M-i = \d\ - 1, for d e {-(fc + 1), . . . , -1} (2) 

Ld,e = z + lcp(z + l,d + z + 1), 
where z = min(TO, max(Ld_e-i + 1, ^d-i.e-i, ^d+i,e-i + 1)) 

Lines 1 — 3 are boundary conditions. Line 4 — 5 determines Ld.e from Ld.e-i, -^d-i.e-i, and Ld+i,e-i and the 
length of the longest common prefix of P[z + 1, m] and Q[d + z + 1]. Thus we can fill in the entries of L by 
iteratively by computing the sets of diagonal positions L-i, Lq, . . . , L^, where L^. denotes the set of entries 
in L with error e. Since we can compute Icp queries in constant time using the data structure from the 
previous section the total time to fill in the 0{nk) entries of L is 0{nk). Each set of diagonal positions and 
the sufhx tree require 0{n) space. However, we can always divide Q into overlapping substrings of length 
2m — 2k with adjacent substrings overlapping in m + fc — 1 characters. A substring matching P with at most 
k errors must have a length in the range [m — k,m + k] and therefore all matches are completely contained 
within a substring. Applying the LV-algorithm to each of the substrings independently solves approximate 
string matching in time 0{n/m ■ mk) — 0{nk) as before, however, now the space is only 0{'m). 



3 Manipulating Bits 

In this section we introduce the necessary notation and key primitives for manipulating bit strings. 

Let X = bi . . . bi he a bit string consisting of bits 6i, . . . , 6; numbered from right-to- left. The length of 
X, denoted is I. We use exponentiation for bit repetition, i.e., O'^l — 0001 and • for concatenation, i.e., 
001 • 100 ~ 001100. In addition to the arithmetic operators — , and x we have the operators & , |, and 
denoting bit-wise 'and', 'or', and 'exclusive-or'. Moreover, x is the bit-wise 'not' of x and x <^ j and x ^ j 
denote standard left and right shift by j positions. The word RAM supports all of these above operators for 
bit strings stored in single words in unit time. Note that for bit strings of length 0{w) we can still simulate 
these instructions in constant time. 

We will use the following nearest common ancestor data structure based on bit string labels in our 
algorithms. 

Theorem 2 (Alstrup et al. [2]) There is a linear time algorithm that labels the t nodes of a tree T with 
bit strings oj length O(logi) bits such that from the labels of nodes v and w in T alone, one can compute the 
label o/nca(u,w) in constant time. 

For our purposes we will slightly modify the above labeling scheme such that all labels have the same length 
/ = O(logt). This is straightforward to do and we will present one way to do it later in Section [6.4.11 Let 
label(w) denote the label of a node v in T. The label nearest common ancestor, denoted Inca, is the function 
given by lnca(label('!;), label(w)) = label(nca(u, w)) for any two labels label(t;) and label(u') of nodes v and 
w in T. Thus, Inca maps two bit strings of length / to a single bit string of length /. 
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3.1 Packed Sequences 

We often interpret bit strings as sequences of smaller bit strings and integers. For a sequence Xi,. . . ,Xr of 
bit strings of length / define the f -packed sequence X = {xi,. . . , x^) to be the bit string 



Hence \X\ = r(f + 1). Each substring • Xj, 1 < i < r, is a field. The leftmost bit of a field is the test bit 
and the remaining / bits, denoted X{i) = Xi, is the entry. The length of X is r. If Xi, . . . , is a sequence 
of /-bit integers (xi, . . . , Xr) is interpreted as (bin(a;i), . . . , hm{xr)), where bin(a;) is the binary encoding of 
X. We represent packed sequences compactly in words by storing s = [r{f + l)/w\ fields per word. For our 
purposes we will always assume that fields are capable of storing the total number of fields in the packed 
sequence, that is, / > logr. Given another /-packed sequence Y = {yi, . . . ,yr), the zip of X and Y, denoted 
X I y is the 2/-packed sequence ((xi, j/i), . . . , (x^, 2/r)) (the tuple notation {xi,yi) denotes the bit string 
Xi ■ Vi). Packed sequence representations are well-known within sorting and data structures (see e.g., the 
survey by Hagerup [15]). In the following we review some basic operations on them. 

Let X = (xi, . . . , Xs) and Y = . . . , t/g) be /-packed sequences stored in single words. We consider 
the general case of packed sequences using multiple words later. Some of our operation require precomputed 
constant depending on s and /, which we assume are available (e.g., computed at "compile-time"). If this 
is not the case we can always precompute all constants in time 0(log'^^^-' w) which is neglible. 

Elementwise arithmetic operations (modulo 2^) and bit- wise operations are straightforward to implement 
in 0(1) time using the built-in operations. For instance, to compute (xi + t/i mod 2^ , . . . ,Xs + ys mod 2-f) 
we add X and Y and clear the test bits by & 'ing with the constant Igj = (10-^)^ consisting of I's at all test 
bit positions in a word. The test bit position ensure that no overflow bits can affect neighbouring entries. 

The compare of X and Y wrt. an operator [xjg {=,7^,>,<} is the bit string C where all entries are 
and the ith test bit is 1 iff x^ ixa y^. For the > operator we compute the compare as follows. Set the test bits 
of X by I 'ing with Is.f, then subtract Y, and mask out the test bits by & 'ing with Igj. It is straightforward 
to show that the ith test bit in the result "survives" the subtraction iff Xi > yi. The entire operation takes 
0(1) time. The other operators =, 7^, and < are computed similarly in 0(1) time. 

Given a sequence of test bits ti, . . . , stored at test bit position in a bit string T, i.e., T = ts-0-^ ■ ■ ■ tiO^ , 
the extract of X wrt. T is the /-packed sequence E given by 



To compute the extract operation copy each test bit to all positions in their field by subtracting (Isj ^ /) 
from T and then & the result with X. This takes 0(1) time. We can combine the compare and extract 
operation to compute more complicated operations. For instance, to compute the elementwise maximum 
M = (max(xi,yi), . . . ,ma.x{xs,ys)) compare X and Y wrt. > and let T be the result. Extract from X 
wrt. T the packed sequence Mx containing all entries in X that are greater than or equal the corresponding 
entry in Y. Also, extract from Y wrt. T k,Irj the packed sequence My containing all entries in Y that are 
greater than or equal the corresponding entry in X. Finally, combine Mx and My into M by j'ing them. 

Let 2 be a bit string of length /. The rank of z in X, denoted by rank(X, z), is the number of entries 
in X smaller than or equal to z. We can compute rank(X, z) in constant time as follows. First replicate 
z to all fields in a words by computing Z = z x 1(0-' 1)" = (z, . . . , z) and then compare X and Z wrt. >. 
The number of 1 bits in the resulting word in C is rank(X, z). To count these we compute the suffix sum of 
the test bits by multiplying C with (O-^'l)* producing a word P, such that P{i) is number of test bits in the 
rightmost i field of C. Finally, extract the result P{s) as the results. Note that the condition / > logr is 
needed here. 

All of the above constant time algorithms, except for rank, are straightforward to generalize to general 
/-packed sequences stored in multiple words by applying the algorithm to the words in the sequence. The 
time for the operations is then 0{r/s + 1) = 0{rf/w + 1). 



• Xr • • Xr-1 • • • • X2 • • Xi 




Xi if = 1, 
otherwise. 
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We will also need more sophisticated packed sequence operations. First, define an f -packed function to 
be a 2/-packed sequence G = {{zi, g{zi)), . . . , {zu, g{zu))) of length u such that zi < • • • < z„ and g is any 
function mapping bit string of length / to bit strings of length /. The domain of G, denoted dom(G'), is 
the sequence (zi, . . . , z„). Let X = (xi, . . . , Xr) and Y = (j/i, . . . , y^) be /-packed sequences and let G be a 
/-packed function such that each entry in X appears in dom(G). Define the following operations. 

Map(G, X) : Return the /-packed sequence (^(xi), . . . , g{xr))- 

Lnca(X, Y) : Return the /-packed sequence (lnca(a;i, j/i), . . . , lnca(xr, Ur))- 

In other words, the Map operation applies g to each entry in X and Lnca is the elementwise version of the 
Inca operation. We believe that algorithm for these operations might be of independent interest in other 
applications. In particular, the Map operation appears to be a very useful primitive for algorithm using 
packed sequences. Before presenting our algorithms for Map and Lnca we show how they can used to 
implement the LV-algorithm. 

4 From Landau-Vishkin to Mapping and Label Nearest Common 
Ancestor 

In this section we give an implementation of the LV-algorithm based on the Map and Lnca operations. Let 
P and Q be strings of length m and 2m — 2k and k be an error threshold. Recall from Section [2] that an 
algorithm for this case immediately generalizes to find approximate matches in longer strings. We preprocess 
P and Q and then use the constructed data structures to efficiently implement the LV-algorithm. 

4.1 Preprocessing 

We compute the following information. Let r = 0(rn) be the number of diagonals in the LV-algorithm on P 
and Q. 

• The suffix tree TpQ consisting of 0(m) leaves and 0(m) nodes in total. The leaf in T representing 
suffix i of P and suffix i Q is denoted leaf (P, i) and leaf (Q, i), respectively. 

• Nearest common ancestor labels for the nodes in Ppg with maximum label length / — O (log to). The 
label for a node v is denoted label(f ). 

• The /-packed functions Np, Nq, and D, representing the functions given by np{i) — label(leaf (P, i)), 

for i e {1, . . . , m}, n^{i) = label (leaf (Q, i)), for i £ {1, . . . , 2to — 2k}, and d{la.hel{v)) = strdepth(w) 
for any node v in TpQ. 

• The /-packed sequences 1^./ and Mrj consisting of r copies of 1 and m, respectively, and the /-packed 
sequence Jrj = (1, 2, . . . , r). 

Since r = 0{'m) the space and preprocessing time for all of the above information is 0{m). 

4.2 A Packed Landau-Vishkin Algorithm 

To implement the LV-algorithm we represent each of the sets of diagonal positions . . . , as /-packed 
sequences of length r. We construct L_i by inserting each field in constant time according to and after 
computing Lfe we inspect each field in constant time and report any matches. These steps take 0{r) = 0{m) 
time in total. We show how to compute the remaining sets of diagonal positions. Let L — L^-i for some 
e G {0, . . . , A; — 1}. The following algorithm computes L' = L^+i- Fill in the 0(1) boundary fields according 
to ^ and compute the remaining fields using the following 4 simple steps. 
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step 1: Compute Maximum Diagonal Positions Compute the /-packed sequence Z given by 

Z(i) := min(max(L(i - 1) + 1, L{i) + 1, L{i + 1) + 1)) 

Thus, Z corresponds to the "z" part in line 5 of ([2])). To compute Z construct packed sequences Zi, Z2, 
and Z3 given by Zi{i} L{i - 1) + 1, ^2(4) := L{i} + 1, and ^3(1) L{i + 1) + 1 by shifting L one field 
to the right, left, or not at all, and adding Ir.f- Compute the elementwise maximum of Zi, Z2, and Z3 and 
then take the elementwise minimum with Mrj to get Z. 

Step 2: Translate to Suffixes Compute the /-packed sequences Zp and Zq given by 

Zp{i) := ZiS) + 1 
Z^^i) := Z{i) + i + m. 

Hence, Zp{i) and ZQ{i) contains the inputs to the Icp part in line 4 of ([2]). To compute Zp and Z^ add to 
Z the sequences l^j and J^./ -I- M^j, respectively. 

Step 3: Compute Longest Common Prefixes Compute the /-packed sequence LCP given by 

LCP := Maf{D, Lnca(Map(A^p, Zp), Map(A^^, Zq))) 
This corresponds to the computation of Icp in line 4 of ((21) . 

Step 4: Update State Finally, compute the new sequence L' of diagonal positions as the packed sequence 
L' given by 

L'{i) = +LCP(i). 

This is the -I- in fine 4 of (g]). 

Steps 1, 2, and 4 takes 0{rf/w) = O {m log m/w) time. Note that a set of diagonal positions of LV- 
algorithm requires 0(m log m) bits to be represented and therefore this bound is optimal within constant 
factors. We parameterize the complexity for approximate string matching in terms of the complexity for the 
Lnca and Map operations. 

Lemma 1 Let P and Q he strings of length m and n, respectively, and let k be an error threshold. Given a 
data structure using s space and and p preprocessing time that supports MAP and Lnca in time q on inputs 

of length 0{m) we can solve approximate string matching in time O ■ q-\- ""^ +p + n^ and space 

0[s + m). 

Proof. We consider two cases depending on n. Suppose first that n < 2m — 2k. In this case we can 
implement the above algorithm using the data structure for Map and Lnca since all of packed sequences 
have at most 0{m) entries. The preprocessing time and space is 0(m +p). Steps 1, 2, and 4 use time 
0[rf jvj + 1) = OirnXogmjw + 1) and step 3 use time q. Computing all of the 0(fc) state transitions 
therefore takes time 0{k{q + *" ) + p + m). Since n/m — 0{1) the result follows. If n > 2m — 2k we 
apply the algorithm to 0{n/m) substrings of length 2m — 2k as described in Section [51 The computation 
for each of the substrings is independent and therefore we can reuse the space to get 0{p -I- m) space. The 
time is 

„ f n , f m log m\ \ „ f nk nk log m \ 

0{ — - k- { q+ — ] + p + n] = 0{ q+ — +p + n] . 

\m \ w J J \m w J 

□ 
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5 Implementing Lnca and Map 



In this section and the foUowing section we show how to implement the Lnca and Map operation efhciently. 
For simphcity, we will initially assume that our word RAM supports a constant number of non-standard 
instructions. A non-standard instruction implements a given function taking 0(1) words to 0(1) words. We 
will subsequently implement the non-standard instruction using either tabulation or word-level parallelism 
leading to the two parts of Theorem [1] We emphasize that the main result in Theorem [1] only uses standard 
instructions. 

To implement Lnca we will simply assume that Lnca is itself available as a non-standard instruction, 
i.e., given two /-packed sequences X and Y stored in single words we can compute Lnca(X, y) in constant 
time. Since Lnca is an elementwise operation this immediately implies a result for general packed sequences: 

Lemma 2 Let X and Y be f -packed sequence of length r. With a non-standard Lnca instruction we can 
compute Lnca(X, y) in time + 1). 

Proof. Simply compute the ith word of Lnca(X, Y) in constant time from the ith word of X and Y. This 
takes 0{\r{f + 1)/H) = 0{rf/w + 1) time. □ 

The output words of the Map operation may depend on many words of the input and a fast way to 
collect the needed information is therefore required. We achieve this with a number of auxiliary operations. 
Let X and Y and be /-packed sequences of length and let G be an /-packed function also of length r. Define 

Zip{X, Y) : Return the 2/-packed sequence X ^Y. 

Unzip(X I Y) : Return X and Y. This is the reverse of the Zip operation. 

Merge(X, y) : For sorted X and Y, return the sorted /-packed sequence of the 2r entries in X and Y. 
Sort(X) : Return the /-packed sequence of the sorted entries in X. 
Map^(G,X) : For sorted X, return Map(G,X). 

With these available as non-standard instruction we obtain the following result for /-packed sequence stored 
in multiple words. 

Lemma 3 Let X and Y be f -packed sequences of length r and let G be an f -packed function of length u. 
With Zip, Unzip, Merge, Sort and Map^ available as a non-standard instructions we can compute 

(i) Zip(X, Y), Unzip(X X Y), and Merge(X) in time + 1), 

(a) Sort(X) in time 0(^logr + 1), and 
(lii) Map^(G,X) in time 0{^^-^ + 1). 

Proof. Let s = [r( / + l)/w\ denote the number of fields in a word. 

(i) We implement Zip and Unzip one word at the time as in the algorithm for Lnca. This takes time 
0{rf /w -\- 1). To implement Merge we simulate the the standard merge algorithm. First impose a total 
ordering on the entries in X and Y by Zip'ing them with J2r,f = (l,---,2r) thus increasing the fields 
of X and Y to 2/ bits (if J2rj is not available we can always produce any word of it constant time by 
determining the leftmost entry of the word, replicating it to all positions, and adding the constant word 
Jsj = (l,...,s)). We compute MERGE(Ar, F) in 0{r/s) iterations starting with the smallest fields in X 
and Y. In each iteration we extract the next s fields of X and Y, Merge them using the non-standard 
instruction, and concatenate the smallest s fields Z = (zi, . . . , Zs) of the resulting sequence of length 2s to 
the output. We then skip over the next rank(X, Zg) fields of X and rank(y, Zg) fields of Y and continue to 
the next iteration. The total ordering ensures that precisely the output entries in Z are skipped in X and 
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Y. Finally, we Unzip the / rightmost bits of each field to get the final result. To compute rank we only 
need to look at the next s fields of X and Y and hence each iteration takes constant time. In total we use 
time 0{rf /w + 1). 

(ii) We simulate the merge-sort algorithm. First sort each of word in X using the non-standard Sort 
instruction. This takes 0{r / s) time. Starting with subsequences of length Z = s we repeatedly merge pairs of 
consecutive subsequences into sequences of length 21 using (i). After 0(log(r/s)) levels of recursion we are left 
with a sorted sequence. Each level takes 0{r/s + l) time and hence the total time is O ( ^ • log ^ ) = O ( ^ log r) . 

(iii) We implement Map^ (G, X) as follows. Let Gi, . . . , G<^u/s\ be the words of G. We first partition X 
into maximum length subsequences Xi, . . . , X^^^/g-^ such that all entries of Xi appear in dom(Gi). We do so 
in \u/ s\ iteration starting with the smallest field X. Let gi denote the largest field in Gi. In iteration i we 
repeatedly extract the next word from X and compare the largest field of the word with gi to identify the 
word of X containing the end of Xi. Let Z = {zi, . . . , Zg) be this word. We find the end of Xi by in Z by 
computing h = rank(Z, 17^). We concatenate each of the words extracted and the h first fields of Z to form 
Xi and proceed to the next iteration. This takes 0{{r + u)/s + 1) time. 

Next, we compute for i = l,...,[u/s] the /-packed sequences MAP^(Gi, X^) by applying the non- 
standard Map^ instruction to each word in Xi. Since each entry in Xi appears in Gi and Xi is sorted 
this takes constant time for each word in Xi. Finally, we concatenate the resulting sequences into the final 
result. The total number of words in Xi, . . . ,X|-„/s^ is 0{{r -\- u)/s + 1) and hence the total time is also 
0{{r + u)/s + l). □ 

With the operations from Lemma [3] we can now compute Map(G,X) as the sequence M2 obtained as 
follows. Let Jrj = (1, . . . , r). 

(Zi,Z2) Unzip(Sort(Zip(X, J^j)) 
A := Map^(G, Zi) 
(Mi,M2) := Unzip(Sort(Zip(Z2,A))) 



We claim that M2 — Map(G,X). Since X is represented in the / leftmost bits of Zip(X, J^./) — {{xi,l), ... ,{xr,r)) 
we have that Sort(Zip(X, J^,/)) is a 2/-packed sequence {{xi^,ii), . . . , {xi^,ir)) such that a;^^ < ■ ■ ■ < Xi^. 
Therefore, A = Map^(G,Zi) = (.9(2;, J, . . . , g(a;,J) and hence Zip(Z2,A) = ((n, ^(x.J), . . . , (v, ^(x^J)). 
It follows that Sort(Zip(Z2, A)) ({l,g{xi)), . . .,{r,g{xr))) implying that M2 = Map(G,X). 
We obtain the following result. 

Lemma 4 Let X be a f -packed sequence of length r and let G be a f -packed function of length u such that 
all entries in X appear in dom(G). With Zip, Unzip, Merge, Sort and Map^ available as a non-standard 
word instructions we can compute Map(G,X) in time 0(-^^-^^^ + ^logr-|- 1). 

Proof. The above algorithm requires 2 Sort, Zip, and Unzip operations on packed sequences of length r 
and a Map^ operation on a packed function of length u and a packed sequence of length r. By Lemma [3] 
and the observation from the proof of Lemma [3I^i) that we can compute J^./ in constant time per word we 
compute Map(G,X) in time Ol^^-t^ -f- ^ logr -f 1). □ 

By a standard tabulation of the non-standard instructions we obtain algorithms for Lnca and Map 
which in turn provides us with Theorem [Iji) . 

Theorem 3 Approximate string matching for strings P and Q of length m and n, respectively, with error 
threshold k, can be solved in time 0{nk ■ ™ -I- n) and space 0(n' -|- m), for any constant e > 0. 

Proof. Modify the /-packed sequence representation to only fill up the b = Slogn leftmost bits of each 
words, for some constant 6 > 0. Implement the standard operations in all our packed sequence algorithms 
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as before and for the non-standard instructions Lnca, Zip, Unzip, Sort, Merge, and Map"^ construct a 
fookup tables indexed by the inputs to the operation and storing the corresponding output. Each of the 2'^'^''^ 
entries the fookup tabfos stores 0{b) — 0{'w) bits and therefore the space for the tabfos is 2'^('') = n°^^^ . It 
is straightforward to compute each entry in time polynomial in b and therefore the total preprocessing time 
is also 2'-"'''^ b'-"^^^ — nP'^^\ For any constant e > we can choose 5 such that the total preprocessing time 
and space is 0{n'^). 

We can now implement Lnca and Map according to Lemma [2] and |4] with w ~ b = 0{logn) without 
the need for non-standard instruction in time 0{-y^^ + 1) and 0{ + logr -I- 1), respectively. We 

plug this into the reduction of Lemma [1] We have that r,u — 0(m) and / = O(logTO) and therefore 
1 = 0( '''i^g"l^ + logr -I- 1) = Q C"'"!^^" + 1). Since s = p = 0(n'=), we obtain an algorithm for approxi- 
mate string matching using space 0(n^ -t- m) and time • + n) = 0{nk ■ ^-^-^ +n). □ 



6 Exploiting Word-Level Parallelism 

For part (ii) of Theorem[l]we implement each of the non-standard instructions Zip, Unzip, Sort, Merge, 
Map^, and Lnca using only the standard arithmetic and bitwise instruction of the word RAM. This allows 
us to take full advantage of long word lengths and also gives us a more space-efficient algorithm than the 
one from the previous section since no lookup tables are needed. In the following sections we present 
algorithms for each of the non-standard instructions and use these to derive efficient algorithms for the 
/-packed sequence operations. The results for Zip, Unzip and Merge are well-known and the result for 
Sort is a simple extension of Merge. The results for Map^ and Lnca are new. Throughout this section 
let s = + 1)J denote the number of fields in a word, and assume w.l.o.g. that s is a power of 2. 

6.1 Zipping and Unzipping 

We present an O(logs) algorithm for the Zip instruction based the following recursive algorithm. Let 
X — (xi, . . . , Xs) and Y — (yi, . . . , ys) be /-packed sequences. If s = 1 return xi ■ yi and otherwise compute 
recursively the packed sequence 

{{xs/2+i,- {ys/2+1, ■ ■ ■,ys)) ■ {{xi, . . ■,Xs/2) t (yi, ■ ■ ■,ys/2)) ■ 

It is straightforward to verify that the returned sequence is X '\.Y. We implement each level of the recursion 
in parallel. Let Z = Y ■ X — {xi, . . . ,Xs,yi, ■ ■ ■ ,ys)- The algorithms work in logs steps, where each step 
corresponds to a recursion level. At step i, i — 1, . . . , logs, Z consists of 2* subsequences of length 2^°ss-*-i-i 
stored in consecutive fields. To compute the packed sequence representing level i -I- 1 we extract the middle 
2iogs-j fjgi(jg Qf each of the 2' subsequences and swap their leftmost and rightmost halves. Each step takes 
0(1) time and hence the algorithm uses time O(logs). To implement Unzip simply we carry out the steps 
in reverse. 

This leads to the following result for general /-packed sequences. 

Lemma 5 For f -packed sequences X and Y of length r we can compute Zip{X,Y) and UNZip(Ar | Y) in 
time log w + 1). 

Proof. Apply the algorithm from the proof of Lemma [3I^i) using the O(logs) implementation of the non- 
standard Zip and Unzip instructions. The time is 0(r/slogs -I- 1) = 0{rf /wlogw). □ 
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6.2 Merging and Sorting 



We review an 0(log s) algorithm for the Merge instruction due to Albers and Hagerup [1] and subsequently 
extend it to an 0(log^ s) algorithm for the Sort instruction. Both results are based on a fast implementation 
of bitonic sorting which we review first. 

6.2.1 Bitonic Sorting 

A sequence of numbers is bitonic if it is the concatenation of a non-decreasing sequence and a non-increasing, 
or if it can be obtained from such a sequence via a cyclic shift. Batcher [7] gave the following recursive 
algorithm to sort a bitonic sequence. Let Z = (zi, . . . , Zs) be a /-packed bitonic sequence. If s = 1 we are 
done. Otherwise, compute and recursively sort the sequences 

Zmin = min(zi,zi+s/2),min(z2, ^2+5/2), • ■ ■ , min(zs/2, Zs) 
Zmax = max(zi,zi+s/2),niax(z2, 2:2+5/2), • ■ • ,niax(zs/2,Zs), 

and return ^max • ^min- For a proof of correctness see e.g. [10, chap. 27]. Note that it suffices to show that 
ATfnin and ATmax are bitonic sequences and that all values in X^i^ and smaller than all values in ATmax- 

Albers and Hagerup [1] gave an O(logs) algorithm using an idea similar to the above algorithm for 
Zip. The algorithm works in logs -I- 1 steps, where each step corresponds to a recursion level. At step i, 
i — 0, . . . , logs, Z consists of 2* bitonic sequences of length 2'°s«-« stored in consecutive fields. To compute 
the packed sequence representing level i -I- 1 we extract the leftmost and rightmost halves of each of 2' bitonic 
sequences, compute their elementwise minimum and maximum, and concatenate the results. Each step takes 
0(1) time and hence the algorithm uses time O(logs). 

6.2.2 Merging 

Let X = {xi, . . . ,Xr) and Y = {yi, . . . ,yr) be sorted /-packed sequences of length s. To implement 
MERGE(Ar, y) we compute the reverse of Y, denoted by Y^ = {yr, ■ ■ ■ ,yi), and then apply the bitonic 
sorting algorithm to Y^ ■ X. Since X and Y are sorted the sequence X ■ Y^ is bitonic and hence the algo- 
rithm returns the sorted sequence of the entries from X and Y. Given Y it is straightforward to compute 
Y^ in O(logs) time using the property that Y^ — (yi+r/2, • • ■ jUr)^' ■ {yi, ■ ■ ■ iVr/'^)^ and by parallelizing 
each recursive level as in the algorithm for Zip and Merge. Hence, the algorithm for Merge uses O(logs) 
time. 

This leads to the following result for general /-packed sequences. 

Lemma 6 (Albers and Hagerup [1]) For f -packed sequences X and Y of length r we can compute 
MERGE(A:,y) in time log w + I). 

Proof. Apply the algorithm from the proof of LemmaO^i) using the 0(log s) implementation of the Merge 
instruction. The time is 0(r/slogs + 1) = 0{rf /wXogw). □ 



6.2.3 Sorting 

Let X — {xi,...,Xg) be a /-packed sequence. We give an 0(log^ s) algorithm for Sort(X). Starting 
from subsequences of length 1 we repeatedly merge subsequences until we have a single sorted sequence. 
The algorithm works in logs + 1 steps. At step i, i = logs, ... ,0, X consists of 2* sorted sequences of 
length 2'°ss-« stored in consecutive fields. Note that the steps here are numbered in decreasing order. To 
compute the packed sequence representing level i — 1 we merge pairs of adjacent sequences, by reversing 
the rightmost one of each pair and sorting the pair with a bitonic sort. At level the reverse and bitonic 
sort takes O(logi) time using the algorithms described above. Hence, the algorithm for Sort(X) uses time 
0(EllVlog*) = 0(log2s). 

This leads to the following result for a general /-packed sequence. 
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Lemma 7 For a f -packed sequence X of length r we can compute Sort(X) in time logr logw + 1). 

Proof. We implement the merge-sort algorithm as in the proof of Lemma [Sjii). Sorting all words takes time 
log^ s + 1) = log^ w + 1) with the 0(log^ s) implementation of the Sort instruction. Each of the 
0(log(r/s)) = O(logr) Merge steps takes time 0{^logw + 1) by LemmalHl In total Sort(X) takes time 
0(^log2w + ^logrlogw + l) = 0(^logrlogw + l). □ 



6.3 Mapping 

We present an O(logs) algorithm for the Map^ instruction. Our algorithm uses a fast algorithm to compact 
packed sequences by Andersson et al. [3] which we review first. 

6.3.1 Compacting 

Let X — {xi, . . . ,Xs) he a, /-packed sequence. We consider field i with test bit ti in X to be vacant if = 1 
and occupied otherwise. If X contains I occupied fields in the compact operation on X returns a /-packed 
sequence C consisting of the occupied fields of X tightly packed in I rightmost fields of A and in the same 
order as they appear in X. Andersson et al. [3, Lemma 6.4] gave an O(logs) algorithm to compact X. The 
algorithm first extracts the test bits and computes their prefix sum in a /-packed sequence P. Thus P{i) 
contains the number of fields X{i) needs to be shifted to the right in the final result. Note that the number 
of vacant positions in P can be up to r and hence we need / > logs. We then move the occupied fields in 
X to their correct position in logs -I- 1 steps. At step i, i = 0, . . . ,logs extract all occupied fields j from X 
such that bit i of P{j) is 1. Move these fields 2' position to the right and insert them back into X. 

The algorithm moves the occupied fields their correct position assuming that no fields "collide" during 
the movement. For a proof of this fact see [20, Section 3.4.3]. Each step of the movement takes constant 
time and hence the total running time is O(logs). Thus, we have the following result. 

Lemma 8 (Andersson et aL [3]) We can compact an f -packed sequence of length s stored in 0(1) words 
in time O(logs). 

6.3.2 Mapping 

Let X = {xi, . . . ,Xs) be a sorted /-packed sequence and let G = ((zi, ^(zi)), . . . , {zs,g(zi))) be a /-packed 
function representing a function such that all entries in X appear in dom(G'). We compute Map'^(G,X) in 
4 steps: 

Step 1: Merge Sequences First construct 2/ + 1-packed sequences X ~ ((xi, 0, 0), . . . , (a;s, 0, 0)) and 
G — {{zi, 1, g(zi)), . . . , (zs, with two zips. The 1-bit subfield in the middle, called the origin bit, is 

for X and 1 for G. 

Compute M = Merge(G,X). Since entries from X and dom(G) appear in the rightmost /-bits of the 
fields in G and X, identical values from X and dom(G) are grouped together in M. We call each such a 
group a chain. Since the entries in dom(G) are unique and all entries in X appears in dom(G), each chain 
contains one entry from G followed by or more entries from X. Furthermore, since the origin bit is 1 
for entries from G and from X, each chain starts with a field from G. Thus, M is the concatenation of 
|dom(F)| = s chains: 

M = Gi • • • G, - ((zi, 1, g(zi)), (zi, 0, 0), . . . , (zi, 0, 0)) • • • ((z„ 1, g(z,)), (z„ 0, 0), . . . , (z„ 0, 0)). 

" V ' " V ' 

or more or more 

All operations in step 1 takes 0(1) time except for Merge that takes 0(log s) time using the algorithm from 
Section [6X1 
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Consider a chain C — {{zj, 1, f{zj)), {zj,0, 0), . . . , {zj,0, 0)) in M with p fields. Each of the p — 1 fields 
(zj, 0, 0), . . . , {zj,0, 0) correspond to p — 1 identical fields from X, and should therefore be replaced by p — 1 
copies of f{zj) in the final result (note that for p = 1, Zj ^ X and therefore f{zj) is not present in the final 
result). The following 3 steps convert C to p— 1 copies of f{zj) as follows. Step 2 removes the leftmost field of 
C. If p = 1, C = ((zj, 1, f{zj))) is completely removed and does not participate further in the computation. 
Otherwise, we are left with C = {{zj,l, f{zj)), {zj,0, 0), . . . , (zj, 0, 0)) with p — 1 > fields. Step 3 computes 
the chain lengths and replaces C with {{p — 1, 1, f{zj)). Finally, step 4 converts this to p— 1 copies of f{zj). 

Step 2: Reduce Chains Extract the origin bits from M into a sequence O, shift O to the right to set 
all entries to right of the start of each chain to be vacant and then compact. The resulting sequence is 
a subsequence of I reduced chains Cjj , . . . ,Ci^ from Ci • • • C^. Note that I is the number of chains of length 
> 1 in Af and therefore the number of unique entries in X. Hence, 

M'=a,--- a, = ((z., , 1, /(z. J), {z,, ,0,0),..., (zi, 0, 0)) • • • {{z,, , l, /(zO), {z^, ,0,0),..., {z,, , 0, 0)). 

" V ' " V ' 

or more or more 

All operations in step 2 takes 0{1) time except for the compact operation that takes O(logs) time by 
Lemma [S] 

Step 3: Compute Chain Lengths Replace the rightmost subentry of each field in by the index of 
the field. To do so unzip the rightmost subentry and zip in the sequence Jrj instead. Set all fields with 
origin bit to be vacant producing a sequence Af* given by 

Af* = l,/(z.J),±__^) • • • (sia,), l,/(zO),±^_^), 

or more or more 

where s(C) is the start index of chain C and _L denotes a vacant field. We compact Af * and unzip the origin 
bits to get a 2/-packed sequence 

5 = ((s(c.j,/(z,j),...,(s(a,),/(^0))- 

The length of Ci^, denoted l{Ci.), is given by l{Cij) = s(Cij^.i) — s{Cij), 1 < j < I. Hence, we can compute 
the lengths for all chains except the by subtracting the rightmost subentries of S from the rightmost 
subentries of S shifted to the right by one field. We compute the length of Ci, as \S\ — s(Ci, ) + 1 and store 
all lengths as the /-packed sequence 

L=((/(C,J,/(z,J),...,(/(CO,/(^0)>- 

As in step 2, all operations in step 3 takes 0(1) time except for the compact operation that takes O(logs) 
time by Lemma [H] 

Step 4: Copy Function Values Expand each field {liCi-), f{zj)) in L to l{Ci-) copies of f{zj). To do 
so we run a reverse version of the compact algorithm that copies fields whenever fields are moved. We copy 
the fields in logs iterations. At iteration h, h — logs, ... ,0 extract all fields j from X such that bit h of the 
right subentry of L{j) is 1. Replicate each of these fields to the 2'* fields to their left. Finally, we unzip the 
rightmost subentry to get the final result. Each of O(logs) iterations take 0(1) time and therefore step 4 
takes O(logs) time. 

Each step of the algorithm for Map^(F, X) uses time O(logs). This leads to the following result for 
general /-packed sequences. 

Lemma 9 For a sorted f -packed sequence X with r entries and a f -packed function G with u entries such 
that all entries in X appear in dom(G') we can compute Map'^(G', X) in time 0{ ^^'^^^'^ logw). 
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Proof. Each of the 0{{r + u)/s) — 0{{r + u)f /w) Map instructions used in the algorithm in the proof of 
Lemma [3] take O(logs) — 0{\ogw) time. In total the algorithm takes time 0{ ^''^'''^ logw). □ 

Plugging the above results in the algorithm for Map from Section [5] we obtain the following result. 

Lemma 10 For a f -packed sequence X with r entries and a f -packed Junction G with u entries such that 
all entries in X appear in dom(G') we can compute Map(G', X) in time logr logw + ^'^"^^•^ log w). 

Proof. The algorithm from Section [5] does a constant number of Sort, Zip, and Unzip operations on 
packed sequences of length r and perform a single Map^ operation on a packed function of length u and a 
sequence of length r. By Lemmas[Sl[71 and[H]this takes time logrlogw + ^''"'^^•^ \ogw). □ 



6.4 Label Nearest Common Ancestor 

We present an 0(log/) algorithm for the Lnca instruction. We first review the relevant features of the 
labeling scheme from Alstrup et al. [2]. 

6.4.1 The Labeling Scheme 

Let T a tree with t nodes. The labeling scheme from Alstrup et al. [2] assigns to each node u in T a unique 
bit string, called a label and denoted label(u), of length 0{\ogt). The label is the concatenation of three 
identical length bit strings: 

label('y) — p{v) ■ h{v) ■ l{v) 

The label p(w), called the part label, is the concatenation of an alternating sequence of variable length bit 
strings called lights parts and heavy parts: 

p{v) — ho ■ li ■ hi ■ ■ ■ Ij ■ hj 

Each heavy and light part in the sequence identify special nodes on the path from the root of T to v. The 
leftmost part, ho, identifies the root. The total number of parts in p{v) and the total length of the parts is 
at most O(logt). For simplicity in our algorithm we use a version of the labeling scheme where the parts are 
constructed using prefix free codes (see remark 2 in Section 5 of Alstrup et al. [2] ) . This implies that if part 
labels p{v) and p{w) agree on the leftmost i — 1 parts, then part i in p{v) is not a prefix of part i in p{w) and 
vice versa. We also prefix all parts in all part labels by a single bit. This increases the minimum length of 
a part to 2 and ensures the longest common prefix of any two parts is at least 1. Since the total number of 
parts in a part label is O(logt) this increases the total length of part labels by at most a factor 2. 

The sublabels b{v) and l{v) identify the boundaries of parts inp{v). The sublabel b{v) has length |p(t;)| + l 
and is 1 at each leftmost position of a light or heavy part in p{v) and 1 at position \p{v)\ + 1. The sublabel 
l{v) has length \p{v)\ and is 1 at each leftmost position of a light part in p{v). The total length of label(w) 
is3\p{v)\ + l = 0{\ogt). 

For our purposes we need to store labels from T in equal length fields in packed sequences. To do so 
compute the length c of the maximum length part label assigned to a node in T. Note that c is an upper 
bound on any sublabel in T. We store all labels in field of length / = 3c bits of the form {p{v) ■ O^"!?"'")! , b{v) ■ 
Oc-l''HI^^(„).0'=-l'('')l), i.e., each sublabel is stored in a subfield of length c aligned to the left of the subfield 
and padded with O's to the right. 

Alstrup et al. [2] showed how to compute lnca of two labels in T. We restate it here in an form suitable 
for our purposes. First we need some definitions. For two bit strings x and y we write x <icx y if and only 
if X precedes y in the lexicographic order on binary strings, that is, a: is a prefix of y or the first bit in which 
X and y differ is in a; and 1 in y. To compute the lexicographic minimum of x and y, denoted minjox we 
can shift the smaller to left align x and y and then compute the numerical minimum. Let x = p{v) and 
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y = p{w) be part labels of nodes v and w. The longest common part prefix of x and y, denoted lcpp(a;, y) is 
the longest common prefix of x and y that ends at a part boundary. The leftmost distinguishing part of x 
with respect to y, denoted Idpj^(x), is the part in x immediately to the right of lcpp(x,y). 

Lemma 11 (Alstrup et al. [2] (Lemma 5)) Let x = p{v) and y — p{w) be part labels of nodes v and w. 
Then, 



p(nca(w, w)) 



{lcpp(a;,?;) if\dpy{x) is a heavy part , 

lcpp(x, y) I (minicx(ldpj^(a;), Idp^(y)) > \\cpp{x,y)\) ifldpy{x) is a light part. 



From the information in the label and Lemma[TT]it is straightforward to compute lnca(a;, y) for any two labels 
x,y stored in 0(1) words in 0(1) time using straightforward bit manipluations. We present an elementwise 
version for packed sequences in the following section. 



6.4.2 Computing Label Nearest Common Ancestor 

Let X and Y be /-packed sequence of length s. We present an 0(log/) algorithm for the Lnca{X,Y) 
instruction. We first need some additional useful operations. Let a; ^ be a bit string. Define lmb(a:) and 
rmb(x) to be the position of the leftmost and rightmost 1 bit of x, respectively. Define 

lsmear(2;) = ol^l-™''^^) • 1™''^^) 
rsmear(2;) = l''"''^^) • qI^I-''^'^^^) 

Thus, lsmear(a;) "smears" the rightmost 1 bit to the right and clears all bits to left. Symmetrically, rsmear(a;) 
smears the leftmost 1 bit to the left and clears all bits to left. We can compute lsmear(a:) in 0(1) time since 
lsmear(2:) — x (B (x — 1) (see e.g. Knuth [18]). Since rsmear(a;) = (lsmear(a:^))'^ and a reverse takes time 
0(log/x/) as described in Section [6.2.2l we can compute rsmear(a:) in time 0(log |a;|). Elementwise versions 
of Ismear and rsmear on /-packed sequences are easy to obtain. Given a /-packed sequence X of length s 
we can compute the elementwise Ismear a.s X Q) {X — Is,/)- We can reverse all fields in time 0(log/) and 
hence we can compute the elementwise rsmear in time 0(log/). 

We compute Lnca{X, Y) as follows. We handle identical pairs of labels first, that is, we extract all fields 
i from X{i) such that X{i) = Y{i) into a sequence U. Since lnca(x,x) = x for any x the labels in L', we 
have that Lnca{X, Y){i) — X{i) for these fields. We handle the remaining fields with different labels using 
the 3 step algorithm below. We then | the result with L' to get the final sequence. 



Step 1: Compute Masks Unzip the //3-packed sequences Xp, Xt, Xi, Yp, Yh, and Yi from X and Y 
corresponding to each of the 3 sublabels. We compute //3-packed sequences of masks Z , M, Mx, and My 
to extract relevant parts from Xp and Yp. The mask are given by 

Z(i) := lsmear(Xp(i) Yp{i)) 

U{i) := rsmcar(Xb(i) & Z{i)) 
Rvii) := lsmear(([/(i) > 1) & Xb{i)) < 1 
Rx{i) := lsmear(([/(i) > l)kYb{i)) < 1 

We explain the contents of the masks in the following. Figure [T] illustrates the computations. The mask Z{i) 
consists of I's in position z = rmh{Xp{i) (BYp{i)) and all positions to the left of z. Since Xp and Yp are distinct 
labels z is the rightmost position where Xp{i) and Yp{i) differ. Since the parts are prefix free encoded and 
prefixed with we have that z is a position within Idpy {Xp{i)) and \dpx (i) 0^p{i)) and it is not the leftmost 
position. Consequently, u — \mh{Xi,{i) & Z{i)) is the leftmost position of Idpy^^,^ {Xp{i)) and ldpXp{i) iYp{i)), 
and therefore the leftmost position to the right of lc-p-p{Xp{i) ,Yp{i)). Hence, U{i) = rsmear(Xf,(i) & Z{i)) 
consists of I's in all positions to the right of lcpp(Xp(z), Yp{i)). This implies that lmb(J7(z) ^ l)&Xfc(i)) is the 
position immediately to the right o{\dpYj,{i) (^p(*)) (if Idpy^^j^ {Xp{i)) is the rightmost part this still holds due 
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Figure 1: Computing hica{Xp{i),Yp{i)). The solid lines in Xp{i) and Yp{i) show part boundaries and the 
dashed lines show boundaries for \cpp{Xp{i) ,Yp{i)) and Idpy^^j^ a and f3 are arbitrary bit strings. 

to the extra bit in Xi, at the rightmost position). Therefore Ryii) '■= lsniear((C/(i) 3> l)&Xb(i)) <C 1 consists 
of I's in all positions of lcpp(Xp(i), yp(i)) and Idpy^^^^ (Xp(i)). Symmetrically, Rx{i) '■= lsmear((C/(i) ^ 
1) ^ 1 consists of I's in all positions of lcpp(Xp(i), yp(i)) and \dY>Xj,{i)(Xp{^))- 

All operations except the elementwise rsmear in the computation of U are straightforward to compute 
in 0(1) time. Hence the time for this step is 0(log/). 

Step 2: Extract Relevants Parts Compute the //3-packed sequences LCPP, LDPy, LDPx, and M 
given by 

LCPP(i) := Xp{i) kTHj) 
LDPy(i) := Xp{i) & Rvii) & U{i) 
LDPx(«) := Yp{i) k Rx{i) & U{i) 
M{i) := min(LDPy(i),LDPx(«)) 

From the definition of the mask in step 1 we have that LCPP(i) = lcpp(Xp(i), Yp{i)). The sequence LDPy (i) 
isX{i) where all but Idpy {Xp{i)) is zeroed and therefore LDPY(i) = Idpy {Xp{i)) |lcpp(Xp(i), 
(see Figure ID). Similarly, LDPx(i> = IdpXp(i) (>^p(j)) > |lcpp(Xp(i), rp(i))|. The parts Idpy^^^^ and 
IdpXp(i) (^(j)) are left aligned in LDPy(i) and LDPx(i) and all other positions are 0. Hence, 

M{i) = min(LDPy(^),LDPx(^)) = min(LDPy(^), LDPx(j)) > \lcpp{Xp{i},Yp{i))\. 

lex 

The time for this step is 0(1). 

Step 3: Construct Labels The part labels are computed as the //3-packed sequence P given by 

^ _ ('LCPP(i) if lsmcar(X6(i) & U{i)) = lsmear(Xi(i) & U{i)) , 

^ |LCPP(i) I M{i} otherwise. 

Recall that U (i) consists of I's at all position in of I's in all positions to the right of lcpp(Xp(i), Yp{i)). Hence 
if lsmear(Xb(i) & U{i)) = lsmear(X;(i) kU{i)) then Idpy^^j^ {Xp{i)) is a light part. By Lemma [TT] it follows 
that P{i) is the part label for lnca,{X {i),Y{i)). To compute P we compare the sequences lsmear(Xb(i)&f7(i)) 
and lsmear(Xi(i) kU{i)), extract fields accordingly from M, and | this with LCPP. The remaining sublabels 
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are constructed by extracting from and Xi using Z. We construct the final /-packed sequence Lnca(X, Y) 
by zipping the sublabels together. 

The time for this step is 0(1). The total time for the algorithm is 0(log/). For general packed sequences 

we have the following result. 

Lemma 12 For f -packed sequences X and Y of length r we can compute Lnca(X, Y) in time log 

Proof. Apply the algorithm from the proof of Lemma [2] using the 0(log/) implementation of Lnca in- 
struction. The time is 0{r/s log / + 1) = log / + 1). □ 



6.5 The Algorithm 

We combine the implementation of Map and Lnca with Lemma [T] to obtain the following result. 

Theorem 4 Approximate string matching for strings Q and P of lengths n and m, respectively, with error 

1 ^ 1 

threshold k can be solved in time 0{nk ■ °g ™ °g ™ _|_ ^i) and space 0{m). 

Proof. We plug in the results for Map and Lnca from Lemmas [T2l and fTOl into the reduction from Lemma[T] 
We have r,u — 0{ra) and / = O(logm) and therefore s = p = 0{m) and q — log r log it; + -^^-tp^ \ogw + 
^log/ + 1) = Q^ miog ^logm _j_ Thus, we obtain an algorithm for approximate string matching using 
space 0(m) and time • " j;^ '"s + n) = 0{nk ■ + n). □ 

Combining Theorems [3] and 3] we have shown Theorem [T] 
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